The Art of Data Visualization: Detecting Multivariate Data Outliers Using an Interactive Approach

نویسندگان

  • Diane Peers
  • Lorin Miller
چکیده

Successfully detecting outliers in multivariate data requires statistical and programming skills and can be very time consuming. Requests for outlier detection can come from different skills groups therefore it is more efficient and effective to allow users to interact directly with the data themselves. We have developed an interactive, web based data visualization application for outlier detection using R Shiny that does not require specialist knowledge to use. This application reads in various file types, manipulates and reduces datasets as needed, performs an array of different outlier detection methods, visually displays the outlier output through interactive graphs and downloadable tables, and provides tests to check any distributional assumptions. This paper visually demonstrates the functionality of the application, which includes exporting all or a subset of the outliers displayed and utilizing machine learning techniques to better predict outliers based on prior decisions made by the user. INTRODUCTION An outlier is an observation that appears to deviate markedly from other observations in the sample. Outlier detection is necessary for two key reasons: firstly, for data cleaning, so that potentially corrupt or inaccurate records can be identified and secondly, to ensure that extreme values are appropriately addressed so as not to skew the analyses. It can be a tedious task deciding which outlier detection method to use and reviewing the validity of tests through the checking of distributional assumptions. It also traditionally requires a programmer or statistician to execute the decided technique and produce any supportive visuals, which can be problematic for less technical professionals or those uncertain as to the options/methods available. Detecting outliers is necessary but can be time and labor intensive. We have streamlined this process with an interactive tool that leads to earlier detection of outliers and hence more time left to focus on the primary analysis. Our outlier application speeds up outlier detection tasks by utilizing flexible data file types, accompanied by the ability to manipulate the dataset as necessary and apply an array of outlier techniques for ease of the user. It also checks any distributional assumptions, and provides different techniques based on the dimensionality of the data. All of the techniques provide supporting visuals before and after the outliers have been removed to further assist in the decision making.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Identification of outliers types in multivariate time series using genetic algorithm

Multivariate time series data, often, modeled using vector autoregressive moving average (VARMA) model. But presence of outliers can violates the stationary assumption and may lead to wrong modeling, biased estimation of parameters and inaccurate prediction. Thus, detection of these points and how to deal properly with them, especially in relation to modeling and parameter estimation of VARMA m...

متن کامل

A Web-based Interactive Data Visualization System for Outlier Subspace Analysis

Detecting outliers from high-dimensional data is a challenge task since outliers mainly reside in various lowdimensional subspaces of the data. To tackle this challenge, subspace analysis based outlier detection approach has been proposed recently. Detecting outlying subspaces in which a given data point is an outlier facilitates a better characterization process for detecting outliers for high...

متن کامل

Discussion of "multivariate functional outlier detection" by M. Hubert, P. Rousseeuw and P. Segaert

I would like to congratulate M. Hubert, P. Rousseeuw and P. Segaert for this stimulating and useful work on outlier detection methods for multivariate functional data. They define and classify rigorously different types of functional outliers and propose several techniques for detecting them in multivariate functional data. These authors use the notion of data depth and distances derived from t...

متن کامل

Outlier Detection in Wireless Sensor Networks Using Distributed Principal Component Analysis

Detecting anomalies is an important challenge for intrusion detection and fault diagnosis in wireless sensor networks (WSNs). To address the problem of outlier detection in wireless sensor networks, in this paper we present a PCA-based centralized approach and a DPCA-based distributed energy-efficient approach for detecting outliers in sensed data in a WSN. The outliers in sensed data can be ca...

متن کامل

Detecting multivariate outliers using projection pursuit with particle swarm optimization

Detecting outliers in the context of multivariate data is known as an important but difficult task and there already exist several detection methods. Most of the proposed methods are based either on the Mahalanobis distance of the observations to the center of the distribution or on a projection pursuit (PP) approach. In the present paper we focus on the one-dimensional PP approach which may be...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017